Search CORE

8 research outputs found

A Word Counting Graph

Author: Furletova Eugenia
Kirakossian Zara
Regnier Mireille
Roytberg Mikhail,
Publication venue: London College Publications
Publication date: 01/06/2009
Field of study

We study methods for counting occurrences of words from a given set H over an alphabet V in a given text. All words have the same length m. Our goal is the computation of the probability to ﬁnd p occurrences of words from a set H in a random text of size n, assuming that the text is generated by a Bernoulli or Markov model. We have designed an algorithm solving the problem; the algorithm relies on traversals of a graph, whose set of vertices is associated with the overlaps of words from H. Edges deﬁne two oriented subgraphs that can be interpreted as equivalence relations on words of H. Let P (H) be the set of equivalence classes and S be the set of other vertices. The run time for the Bernoulli model is O(np(|P (H)| +|S|)) time and the space complexity is O(pm|S| +|P (H)|). In a Markov model of order K, additional space complexity is O(pm|V | K ) and additional time complexity is O(npm|V | K). Our preprocessing uses a variant of Aho-Corasick automaton and achieves O(m|H|) time complexity. Our algorithm is implemented and provides a signiﬁcant space improvement in practice. We compare its complexity to the additional improvement due to AhoCorasick minimization

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Polytechnique

HAL-Rennes 1

Efficient seeding techniques for protein similarity search

Author: Furletova Eugenia
Gambin Anna
Kucherov Gregory
Lasota Slawomir
Noé Laurent
Roytberg Mihkail
Szczurek Ewa
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Efficient seeding techniques for protein similarity search

Author: Roytberg Mihkail
Gambin Anna
Noé Laurent
Lasota Slawomir
Furletova Eugenia
Szczurek Ewa
Kucherov Gregory
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

arXiv.org e-Print Archive

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

On subset seeds for protein alignment

Author: Furletova Eugenia
Gambin Anna
Kucherov Gregory
Lasota Slawomir
Noé Laurent
Roytberg Mikhail A.
Szczurek Ewa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds vs. BLASTP.Comment: IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

MPG.PuRe

Minimized Compact Automaton for Clumps over Degenerate Patterns

Author: Furletova Eugenia
Holub Jan
Regnier Mireille
Publication venue: HAL CCSD
Publication date: 24/01/2019
Field of study

Clumps are sequences of overlapping occurrences of a given pattern that play a vital role in the study of distribution of pattern occurrences. These distributions are sused for finding functional fragments in biological sequences. In this paper we present a minimized compacted automaton (Overlap walking automaton,OWA) recognizing all the possible clumps for degenerate patterns and its usage for computation of probabilities of sets of clumps. We also present Aho-Corasicklike automaton, RMinPatAut, recognizing all the sequences ending with pattern occurrences. The statesof RMinPatAut are equivalence classes on the prefixes of the pattern words. We have proved thatRMinPatAut is Nerode-minimal, i.e., minimal in classical sense. We use RMinPatAut as an auxiliary structure for OWA construction. For degenerate pattern,RMinPatAut can be constructed in linear time on the number of its states (it is bounded by 2m, where m the length of pattern words).OWA can be constructed in linear time on the sum of its size and RMinPatAut size

INRIA a CCSD electronic archive server

HAL-Polytechnique